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Information theory and the framework of information dynamics have been used to provide tools 
to characterise complex systems. In particular, we are interested in quantifying information stor- 
age, information modification and information transfer as characteristic elements of computation. 
Although these quantities are defined for autonomous dynamical systems, information dynamics 
can also help to get a "wholistic" understanding of input-driven systems such as neural networks. 
In this case, we do not distinguish between the system itself, and the effects the input has to the 
system. This may be desired in some cases, but it will change the questions we are able to answer, 
and is consequently an important consideration, for example, for biological systems which perform 
non-trivial computations and also retain a short-term memory of past inputs. Many other real world 
systems like cortical networks are also heavily input-driven, and application of tools designed for 
autonomous dynamic systems may not necessarily lead to intuitively interpret able results. 

The aim of our work is to extend the measurements used in the information dynamics framework 
for input-driven systems. Using the proposed input-corrected information storage we hope to better 
quantify system behaviour, which will be important for heavily input-driven systems like artificial 
neural networks to abstract from specific benchmarks, or for brain networks, where intervention is 
difficult, individual components cannot be tested in isolation or with arbitrary input data. 



I. INTRODUCTION 

In his 1990 paper [I], Langton addresses the question 
under what conditions physical systems support the ba- 
sic operations of information transmission, information 
storage, and information modification to support compu- 
tation. In this investigation, cellular automata (CAs) are 
used as a formal abstraction of physical systems. Using a 
parameterisation of possible CA rules, a qualitative sur- 
vey of the different dynamical regimes is presented, along 
with the observation that CAs exhibiting the most com- 
plex behavior are, in general, found near the phase tran- 
sition between highly ordered and highly disordered dy- 
namics. Information theory and the framework of infor- 
mation dynamics [2 -4 then provides the tools to quantify 
in complex systems the elements of computation using 
the basic operations information transmission, storage, 
and modification that have been mentioned above. These 
information-theoretic tools provide the means to under- 
stand, and to eventually engineer dynamical systems, a 
task for which a proper understanding of their computa- 
tional properties is required. In contrast to static mea- 
surements of, e.g., entropy of a system at a given time, 
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they focus on dynamical aspects of information process- 
ing. Understanding these dynamical aspects is critical 
and it has been suggested that "the main challenge is 
understanding the dynamics of the propagation of infor- 
mation ... in networks, and how these networks process 
such information." [5]. 

Systems like CAs are autonomous dynamical systems, 
the evolution of their states at any given moment de- 
pends on a state-transition function and the current 
state. When instead dynamical systems are driven by 
some external input, the available tools may not be suit- 
able to fully characterize them, and not lead to intu- 
itively interpretable results: in this case, the informa- 
tion dynamics framework cited above, though useful to 
get a "wholistic" understanding of complex systems to- 
gether with their input, will not necessarily provide use- 
ful information about the system in isolation. This is 
an important observation, as for example biological sys- 
tems perform non-trivial computations and also retain a 
short-term memory of past inputs [6 j: using the infor- 
mation dynamics framework, there is no distinction be- 
tween structure of the input into the system and that of 
the system itself. In some of our work, we have measured 
information transfer and active information storage in re- 
current neural networks to show peak performance near 
the edge of chaos [7] for a number of inputs. Many real 
world systems like cortical networks are non-autonomous 
dynamical systems, and heavily input-driven. This re- 
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quires new ways of investigating these systems, in partic- 
ular if inputs are expected to change over time, and we 
are interested in their properties in face of change. 

We are not the first identifying the need for new ways of 
analyzing non- autonomous dynamical systems: the work 
of Manjunath at al. [8] points out new developments in 
this area. Their focus are attr actors, and how the con- 
cept translates to input-driven systems. Speaking about 
an example case, they ask "Where does the perceived 
complexity of the state evolution come from? Is it due 
to the complex nature of the input driving source, or due 
to the complex autonomous dynamics of the individual 
maps [...], or both? Theory of autonomous systems, while 
profound and deep in many respects, is not suitable for 
answering such questions." [8]. 

In this paper, we attempt to provide the theory neces- 
sary to answer some of these questions for input-driven 
systems, starting from a basic concept that is used to 
quantify information storage in autonomous dynamical 
systems, the active information storage [9|. We extend 
this concept to the non- autonomous case, and illustrate 
that the computed quantities match the intuition of in- 
formation storage using simple examples. 



II. ACTIVE INFORMATION STORAGE 

Active information storage, like information theory in 
general, has shown to be useful in general to analyze 
complex systems, and with it shares the advantage of be- 
ing domain independent by using (Shannon) entropy as 
the fundamental quantity upon which it is based. Be- 
fore we give a definition of active information storage, 
we start with a few information-theoretical preliminar- 
ies. Entropy represents the uncertainty associated with 
any measurement x of a random variable X, H(X) = 
— ^ x p(x)logp(x), (where we use 2 as base for the log- 
arithm, and bits as unit for entropy). The conditional 
entropy of X given Y quantifies the amount of infor- 
mation needed to describe the outcome x given that the 
value of y is known: H(X\Y) = -^Z x ^ y p(x,y) log p(x\y). 
The mutual information between X and Y measures the 
the average reduction in uncertainty about x that results 
from learning the value of y, or vice versa, and can be 
expressed via conditional entropies: 

I(X; Y) = H(X) - H(X\Y) = H(Y) - H(Y\X). (1) 

The conditional mutual information between X and Y 
given Z is the mutual information between X and Y when 
Z is known: 



/(X; Y\Z) = H(X\Z) - H(X\Y, Z). 



(2) 



The concept of active information storage is derived [9] 
as the information in an agent, process or variables past 
that can be used to predict its future. In contrast to 
excess entropy, which measures the total stored informa- 
tion that is used at some point in the future of the state 



process of an agent, the active information storage A{X) 
expresses how much of the stored information is actually 
in use at the next time step when the next process value 
is computed. A(X) is expressed as the mutual informa- 
tion between the semi-infinite past of the process X and 
its next state X', with X^ denoting the last k states of 
that process: 



A(X) = lim AW(X) 

k^oo 

A(X,k) = I(X^;X') 



(3) 
(4) 



Eq. Q is also used to represent fc-finite approximations 
of active information storage. 

Active information storage is the average amount of 
information in the past of a process that is in use to 
predict the next step, i.e., the expected value of the local 
active information storage at each time step n + 1. For a 
random variable X, the local active information storage 
for the value x n +i at time step n + 1 is: 

ax(n + l)= lim a x (n + 1, k), (5) 



ax{n + 1, k) = log 



p(x i n\x n j r i) 

p(x^)p(x n+1 ) 



(6) 



In a system of processes X, the local active information 
storage for the value x^ n+i at time step n+1 of a process 
i is defined as: 



a x (i,n+ 1) = a Xi (n + l), 
a x (^,n + l,fc) = a^(n + l,fc). 



(7) 

(8) 



With active information storage the average of its local 
values, we can write: 



A(X) = (a x (n+l)) n , 
A(X,k) = (a x (n+l,k)) n . 



(9) 
(10) 



For sets of homogenous processes we can also average 
over all processes, i.e., 



A(X, k) = (a x (i, n + 1, fe)) i>n . 



(11) 



For details on the derivation and an in-depth discussion 
of active information storage and its properties we refer 
to 0. 



III. ACTIVE INFORMATION STORAGE 
APPLIED TO INPUT-DRIVEN SYSTEMS 

To illustrate the effect of quantifying active informa- 
tion storage in an input-driven system, we look at two 
simple cases: The first case (Fig.fT^i) is a simple forward- 
ing unit, for which the output at step n is the same as 
the input. In the second case (Fig.JlJ)), the unit keeps its 
last output as an internal state. Its output is computed 
as logical xor between input and the internal state. 
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a) Input into system 

0101101100001 




Output ( = Input) 
0101101100001 



b) Input into system 

0101101100001 



Output 

1011011000010 

XOR I— f ► 




FIG. 1. Simple computational units of artificial neural net- 
works may forward or store information. In a), inputs are just 
forwarded to the output, b) implements a XOR- neuron that 
stores the last state to compute the output 



Intuitively, in the first case we would expect zero active 
information storage for the unit, since no information is 
stored in the system. As we shall see, the computed ac- 
tive information storage will in fact depend on the struc- 
ture of the input data. Similarly in the second case, we 
would expect one bit active information storage, since the 
units last state is required to compute its output. Again, 
we will see that the computed active information storage 
depends strongly on the structure of the input data. 

To demonstrate this effect we look at two specific kinds 
of input data, u\ and U2- For u\ we draw values and 1 
independently from a Bernoulli distribution withp = 0.5. 
For 2/2, we also draw binary random values, but impose 
a Markov condition so that with a probability of 0.7 the 
last value is repeated, and with a probability of 0.3, the 
value is changed from to 1 or vice versa. 

Using these two time series to drive the forwarding 
unit, the probability of specific output values will be 
p(x n = 0) = p(x n = 1) = 0.5 in both cases, but the 
joint probabilities of two subsequent values will be dif- 
ferent: For ui, p(x n ,x n +i) = 0.25, but for u 2 , p(x n = 
x n +i) = 0-7, and p(x n ^ x n+1 ) = 0.3. 

For a finite size approximation of active information 
storage with k = 1, the active information storage can 
be computed in both cases from the known (joint) prob- 
abilities (cf. Eq. 10), and evaluates as expected in the 
case of the i.i.d. input from u\, to A(X,1) — 0, since 
25 - 0. It evaluates to, e.g., A(X,1) « 0.1, in 



log — 

0.5-0.5 

the case of structured input from u<i, with A(X,1) = 

In the case of the xor unit, again using independent 
input data m, an output of or 1 is equally likely inde- 
pendent of the current input: p(x n ,x n +i) = 0.25. The 
computed active information storage for a history size 
of k = 1 will be zero. This is clearly counter-intuitive 
since the unit actually stores one bit of information that 
is required to compute its output. 

With increasing history sizes fc, the computed values 
will eventually approximate the intuitively correct val- 
ues of and 1 respectively. Large history sizes, however, 
require large amounts of data to estimate the involved 
joint probabilities p(xn\ x n+ \). Oftentimes, the data re- 
quired to produce reliable estimates are simply not avail- 
able. With larger k and larger data sets, estimation of 



p(x£\x n +\) becomes also more expensive. We aim to 
provide a solution using a new quantity that corrects the 
fc-finite approximation of active information storage for 
input-driven systems. 



IV. ACTIVE INFORMATION STORAGE FOR 
INPUT-DRIVEN SYSTEMS 



To correctly estimate active information storage for 
input-driven systems, we propose to condition out the 
input into the system. The local input-corrected active 
information storage at time step n + 1 for a process X 
with input U thus becomes: 



a u x (n + 1) = lim a u x (n + 1, k) 



(12) 



4(n + !,*) = log (13) 



= log 



p(x^)p(x n +i\u n +i) 

p(x n j rl \x { n\u n j r i) 

p(x n+1 \u n+ i) 



(14) 



This measure can again be generalised to processes Xi 
in a system X: 

ax(i,n+l)= lim ax(i,n + l,fc) (15) 

/e— ^oo 

a£(z,ra + l,fc) =a^(n + l,fe) (16) 
p(x iin+1 \x[V,u n + 1 ) 



log- 



p(Xi,n+l\Un+l) 



(17) 



We then have the input-corrected active informa- 
tion storage A^(i, k) = (ox(i, n, k)) n . For homogenous 
processes we can again average over these, resulting in: 



A%(k) = (a£(i,n,k))i tn . 



(18) 



Applying the measure to our two example cases from 
above, we compute the respective conditional probabil- 
ities, again using a history size of k = 1. In case of 
the forwarding unit, both local conditional probabilities 
p(x n+1 \x { n\ u n +i) and p(x n+1 \u n+1 ) evaluate to 1 for 
both the independent uniform input u\ as well as for 
the structured input ^2, i.e., the input-corrected active 
information storage will be log 1 = 0, independent of the 
input as we would expect. 

In case of the xor unit conditioning on w n +i and x^ 
leads to a probability of 1 for p(x n+ i\xn\u n+ i) while 
p(x n j r i\u n j r i) = 0.5 because of missing information about 

Xn^ • With these probabilities, the = log ^ = 1 

for our second example, again independent of the input 
and exactly as we would expect. 
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V. RELATION OF ICAIS TO OTHER 
MEASURES 

ICAIS can be related to and expressed in terms of a 
number of other measures [T0HT31. 



A. Partial Information Decomposition 

Partial Information Decomposition (PID) is a re- 
cent framework [12] that decomposes information from 
several sources about a destination into information- 
theoretically atomic concepts of redundant, unique and 
synergistic information. In the most simple case, for a 
system with three variables 5, R\ 1 R2 1 we want to know 
how much information provide R\ and R2 about S. It is 
possible to say how much Ri and R2 jointly contribute 
to the total information by using the mutual information 
I(S; Ri, R2)' Decomposing this joint information, the 
amount of information that R\ individually contributes 
(that is not found in R2), or vice versa is the unique in- 
formation. Information that is both in Ri and in R2 is 
called redundant information. The third concept, syner- 
gistic information, describes the situation when neither 
Ri nor R2 alone provide information about S but only 
jointly do so. Figure [2] visualizes the PID for the 3 vari- 
able case. The concept is not limited to 3 variables and 
can be applied to more complicated systems with any 
number of sources, S = {Ri, R n }- As nicely explained 
in [13] , it is defined in terms of an abstract method (in 
form of axioms that need to be satisfied) , which needs an 
instantiation in form of a concrete measure. 

l(S;Ri, R 2 ) 




FIG. 2. Partial Information Decomposition for 3 variables. 



B. Interaction Information 

Interaction information [10] or Co-Information [11] 
is a generalization of mutual information developed by 
McGill respectively Bell. It describes the information 
shared by k random variables, which can be positive or 
negative. The part of interest is the information shared 
between all three variables I(X, Y, Z). Here we want to 



Qabc<0 





FIG. 3. Venn diagramm that visualizes the interaction infor- 
mation between three variables in case of redundancy (left) 
and synergy (right). [14] 



show how this Idea is related to icAIS. Interaction Infor- 
mation for three variables is defined as follows: 



I(X,Y,Z) = 



I(X,Y\Z) 
I(X,Y) 
= I(X,Z\Y) 

I(X,Z) 
= I(Y,Z\X) 
I(Y,Z) 

where I(X,Y\Z) and I(X,Y) are defined as 



(19) 



I(X,Y\Z) = log 2 
I(X,Y) = log 2 



P(X,Y\Z) 
p(X\Z)p(Y\Z) 
P(X,Y) 
p(X)p(Y) 



(20) 
(21) 



As mentioned before interaction information can either 
be positive or negative for k >= 3, what can be inter- 
preted as synergy and redundancy [14]. If two sources 
contribute the same information to a destination redun- 
dancy occurs, this overlap is represented by a negative 
interaction information. In the opposite case of synergy 
and positive interaction information, two variables U and 
V contribute information that does not overlab. (see fig- 
ure [3| 

With icAIS we want to take redundancy and synergy ex- 
plicitly into account. We can say we want to add the 
interaction that occurs between input and history to the 
AIS. We already see that I(X,Y) equates to AIS, while 
equation [22] shows that I(X,Y\Z) equates icAIS. 



I(X,Y\Z) 



P(X,Y\Z) 
p{X\Z)p{Y\Z) 



substituted 



Xn+l 



Y = x ( n k \Z = 
(fe)i 



P(x n+1 \u n+1 ) *p(x ( n ) \u n+1 ) 

! p(x n + 1 ,x£\u n + 1 ) _ ! p(x n + 1 ,U n + 1 ) *p(x£\u n + 



log 
log 



P(Un+l) ' P(Un+l) * P(Un+l) 

p(x n+ i , x^ , u n+ i) * p(u n ^ 



1+1 



p(x n+1 ,u n+1 ) * p(x£\u n +i) 

p (x n+1 \u n+1 ,X ( n ) ) 

p(x n+1 \u n +i) 



(22) 
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Equation 22 proves that Interaction information can be 
written as / = lc ^/g , what can be transformed to 
icAIS = AIS + J matching the asumption we made be- 
fore. As it will be shown later in these thesis synergy and 



redundancy are the main issue applying AIS on a input 
driven system. 
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